Bulletin of Electrical Engineering and Informatics 
Vol. 9, No. 6, December 2020, pp. 2492~2498 
ISSN: 2302-9285, DOI: 10.11591/ee1.v916.2018 O 2492 


Marketplace affiliates potential analysis using cosine similarity 
and vision-based page segmentation 


Wildan Budiawan Zulfikar’, Mohamad Irfan*, Muhammad Ghufron’, Jumadi‘, Esa Firmansyah° 
Department of Informatics, UIN Sunan Gunung Djati, Indonesia 
"Department of ICT, Asia E University, Malaysia 
“School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Indonesia 
> Department of Informatics, STMIK Sumedang, Indonesia 


Article Info ABSTRACT 

Article history: One success factor of an online affiliate is determined by the quality of 
the content source. Therefore, affiliate marketplaces need to do an objective 

Received Aug 15, 2019 assessment to retrieve content data that will be used to choose the right 

Revised Jan 28, 2020 product in the appropriate product filter. Usually, the selection is not made 

Accepted Mar 1, 2020 using a good and measured system so that the selection of product content 


is only based on parts that are not in accordance with what is seen or 
subjective. However, if analyzed using a good and measurable system will 
Keywords: produce an objective product content and can have a positive impact on users 
because the selection is based on factual data. The purpose of this research 
is to analyze the potential of the affiliate marketplace by combining cosine 
similarity with vision-based page segmentation. This is a new breakthrough 


Cosine similarity 
Marketplace affiliates 


Page segmentation made for optimization to get the best content in accordance with the required 
Vision criteria. This work will produce a number of product recommendations that 
Web scraping are appropriate for publication and then made use of for comparison that 


matches the required criteria. At the limited evaluation stage, the performance 
of the proposed model obtained satisfactory results, in which 5 queries tested 
were all as expected. 


This is an open access article under the CC BY-SA license. 


Corresponding Author: 


Wildan Budiawan Zulfikar, 

Department of Informatics, 

UIN Sunan Gunung Djati, 

105th A.H. Nasution Street, Bandung, 40614, Indonesia. 
Email: wildan.b @uinsgd.ac.id 


1. INTRODUCTION 

Nowadays, information technology has created new types and business opportunities where more 
and more business transactions are being made online. Therefore, everyone might easily carry out buying 
and selling transactions [1-3]. Many companies try to offer a variety of products using this media [4, 5]. 
One of the benefits of the existence of the internet is as a media promotion of a product. A product that 
is online via the internet can bring huge benefits to entrepreneurs because the product is known throughout 
the world [4, 6]. 

Web scraping is the process of extracting information from a website. Web scraping is an alternative 
way that chose because the required data is not always available in the API, another source like shared 
database or data warehouse, or even they do not provide the API at all [7-12]. This research has used product 
attribute data obtained from several marketplace affiliates using web scraping techniques. It used one of the web 
scraping methods, vision-based page segmentation. Vision-based page Segmentation is an algorithm for 
website page metadata. Based on previous research, this method of extracting tag tree data can detect content 
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structures quickly [13, 14]. It transforms the deep web into a visual tree [13, 15]. The result is divided into 
several segments and can be processed using DOM parser before it can finally be processed and modeled [13]. 

In addition, the proposed model applies Cosine Similarity and TF-IDF. Cosine-Similarity is one 
algorithm that functions to compare similarities between documents. In this case, what is compared is a query 
with a training document [16-18]. In calculating cosine similarity, first, do a scalar multiplication between 
the query and the document then add up, then do the multiplication between the length of the document and 
the length of the query that has been squared, after that the square root count is calculated [16, 19-23]. 
Furthermore, the results of the scalar multiplication are divided by the results of the multiplication 
of the length of the document and query. 


2. RESEARCH METHOD 
In this article, it will be explained that the existing attribute data is sourced from some marketplace 
data. Data sources are taken directly from the original website. Online marketplace data taken is product data 


that is still active in the product category list. Detailed marketplace affiliate data used in this work is 
described in Table 1. 


Table 1. List of marketplace affiliate 


Marketplace URL Role 

Tokopedia https://www.tokopedia.com Main marketplace 
Bukalapak https://www.bukalapak.com 2" marketplace 
Blanja https://www.blanja.com 3" marketplace 
Lazada https://www.lazada.co.id 4" marketplace 


In this work, the main marketplace is Tokopedia. Then, one product will compare to another 
marketplaces. The use of this method is divided into two processes namely the first process will be scraping 
product data based on all selected web marketplace data. This method uses the id category and name category 
attributes of each product. When the process of web scraping, product data will be divided into several 
categories that will be done using the cosine similarity and vision-based page segmentation methods. 
After the data is formed in the form of HTML dom, the system will determine one of the data used to do 
the process to display the data. The category becomes one of the data used as a reference for this data 
filtering process because it shows the level of each product data based on the category and is appropriate in 
retrieving accurate data and filters in the price and rating order. Product availability is the second factor 
because it supports product competency. 


2.1. Cosine similarity 

The following is a simulation or example of data used in the process. Category data can be seen in 
Table 2. Conditions are adjusted to each category which in this case is limited to 6 categories. The product 
attributes that will be analyzed in the work in detail can be seen in Table 3. 


Table 2. Product categories Table 3. Product attributes 
Categories Code Attributes Code 
Fashion Cat_001 Name prod_name 
Health Cat_002 ID prod_id 
Beauty Cat_003 SKU prod_sku 
Smartphone and tablet Cat_004 Link prod_link 
Laptop Cat_005 Figure prod_fig 
Computer Cat_006 Price prod_price 

Category code prod_cat_id 
Category description prod_cat_name 
Advertiser prod_ads 


Table 4 is the query data that will be calculated using TF-IDF based on a specific query. This work 
involves six queries and four affiliate marketplaces as explained in Table 4. Figure | is a visualization of 
Table 4 to make it easier to read valid data and the same or almost the same then the table is converted into a 
graph diagram. Pictures from the graph diagram of the query and matching with each of the place list lists 
can be seen in the Figure 1. 
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Tabel 4. TF-IDF 


Query Marketplace 1 (D1) — Marketplace 2 (D2) Marketplace 3 (D3) Marketplace 4 (D4) 
Galaxy s7 1 1 0 0 
Samsung 1 1 1 1 
iPhone X 128Gb Black 1 0 0 0 
Galaxy+S7 1 1 1 1 
MacBook Air 1 0 1 1 
Keyboard Razer 1 0 1 0 
10 iphone x 128 

A gb black 

O 8 © a 

2 galaxy s7 © piira OOK all 

I € © eyboard 

x © razer 

oa samsung 

x © 

2 2 
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Figure 1. TF-IDF data on graph 


The first stage of vision-based page segmentation is to determine the initial weight of each query 
manually. For example, the first weight is filled by Samsung's query and the second group is weighted by 
Galaxy. Then obtained: Centroid 1=0.3 and Centroid 2=0.3 as explained in Table 5 and the visualization 
explained in Figure 2. 


Table 5. TF-IDF data and weighting 


10 
Code D1 D2 Description oo 
es E Cc enovo 
Oppo 0 0 = 8 @ OPR° 
Lenovo 0 0 -0 > ¢ o @ asus 
Samsung 0.3 0 Weight 1 v D SAMSUNG 
Asus 0 0 Z234 galaxy 
Galaxy 0 0.3 Weight 2 5 @ iphone 
Iphone 0 0 a. 2 
E 0 
0 5 10 15 20 


Dokumen 1 dan 2 


Figure 2. TF-IDF data in a graph diagram with weighted queries 


Next calculate the distance of each data with each weight using (1) [24, 25]: 


. N 

idf = Logt (1) 

The next work to calculate the weight by comparing query data 1 with each query taken that has 
weight. The query data weighting can be seen in Table 6. Then, look for the average of each weight value 
to be used as a new query weight namely: 


WI New: (0.3, 0.3) 
W2 New: (0.0, 0.0) 


This step will continue to be repeated until the conditions are met. The desired condition is that there 
is no change in the weighting of the data source which means there is no difference between the data query 
and the query in the previous iteration. Then the second iteration will be performed using a new weighting. 
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In this experiment, the algorithm will be completed in the third iteration. The final results are presented in the 
form of a Cartesian diagram to make it easier to see the closeness of the data between the weighting and each 
data as explained on Table 6. Table 6 explains the list of Queries included in the category. Products that are 
in the first weighting are Q3, Q5, Q6 and in the second weighting are Q1, Q2, Q4. 


Table 6. Clustering results 


Wi W2 
Q3 Ql 
Q5 Q2 
Q6 Q4 


2.2. Vision-based page segmentation 

Query retrieved adjusted to the query that has been selected. The higher the weighting value 
of the selected query the higher the query used and conversely the lower the weighting of the query the lower 
weighting of the query is used. Next, calculate the normalization of data according to the vision-based page 
segmentation formula then multiplied by the weights that have been determined at the initialization stage 
namely (2) for profit and (3) for costs: 
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3. RESULTS AND DISCUSSION 

After conducting the previous weighting phase, product data search will be performed based on 
the Vision-based Page Segmentation algorithm. Based on some data that has been classified, only one 
product data will be taken that matches the previous TF-IDF process and to be compared using this 
algorithm. The query groups to be selected are based on the query to be searched. If the query is appropriate, 
then the appropriate group will be taken, and while the position is not appropriate, page 404 or product page 
will not be selected. In this case, the first query is taken that is Samsung Galaxy. The following is the use 
of the vision-based page segmentation algorithm as described in the previous chapter: 
a. First step is determining new product data. Figure 3 describe the default position of product detail 

including name, dimension, price, and any related data of product. 





Figure 3. Scheme design vision-based page segmentation product scraping data 
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b. Then, each product attributes parse by its category and subcategory as describe on Figure 4. 


Figure 4. Web process data page product data scraping 


c. Extract data that will be searched using (4). 
TSV = w(1) *r1R (4) 


Where w(1) is the query data that is input with the total weight that will be searched from the weight value 
of R and r is the number of segments related to the query that already has a value. 
d. Validation of data that has been processed and normalize data according to (5). 


Mina 
TSV = | — i (5) 
ij 


The results of data normalization using are explained in Table 7. 





Table 7. The result of data normalization 
No. Kode  Div.Id — Div.Attr  Div.Class Cache Alpha result 


1 QI 0.875 1 0.8 1 0.5 
2 Q2 1 1 0.8 0.954545455 1 
3 Q4 0.9375 0.875 1 0.909090909 1 


e. Normalization results are multiplied by the weights and summed to find out the final result of 
the preference value with (6) and final preference result explained in Table 8. 


Vi = Xj=1Wjfij (6) 


Table 8. Final preference results 


No. Kode Div.ld Div.Attr Div.Class Cache Alpha result Result 

W 0.3 0.2 0.2 0.15 0.15 

1 QI 0.875 1 0.8 1 0.5 0.8475 

2 Q2 1 1 0.8 0.954545455 1 0.953181818 
3 Q4 0.9375 0.875 l 0.909090909 1 0.942613636 


Based on Table 8, it can be concluded that the most recommended Query data is the Query data with the Q2 
code. Query data Q2 gets 0.95 results and is only 0.01 points different from Q4. 


4. CONCLUSION 

If evaluated from performance, the proposed model gets the appropriate results. 5 queries tested 
everything as expected. The cosine similarity algorithm successfully improvised the vision-based page 
segmentation algorithm and was able to adjust product | to other products that were eligible to be selected 


by searching product data processed by TF-IDF. Further work, we suggest comparing this model with 
different methods. 
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